HL7 Gen v2 Synthetic Data Generation

This notebook implements a driver program for the HL7 synthetic data generation software. It also generates plots of the marginal distributions for all variables, the original and synthetic timeseries, and convergence plots for the correlation matrix elements.

User Inputs

1. Specify the Jurisdiction

The jurisdiction is a string such as "California", "Rhode Island", or "New York". Common state abbreviations can also be used: "CA", "RI", or "NY". The jurisdiction string can specify one of these non-state jurisdictions as well:

Quotation marks are required.

2. Specify the Condition Code

The HL7 data has been organized into a file tree with a folder for each jurisdiction. Each jurisdiction folder contains the data files for all reportable conditions in that jurisdiction. The data files are named with the condition code and a .csv extension.

The condition code should be a five-digit Python integer (only digits, no quotes).

3. Condition Grouping

The data for some conditions should be combined with the data for related conditions. For instance, the data for all hemorrhagic fevers in a given jurisdiction should be combined together prior to synthetic data generation.

The default behavior is to group datasets according to the CDC's specifications. Use the DISABLE_GROUPING Boolean variable to disable dataset grouping. A Python Boolean variable is either True or False (no quotes).

Special Handling for Syphilis

The synthetic data generation software can determine the data to be grouped for every condition except Syphilis. These Syphilis codes:

can be members of either these two groups:

IF one of the codes 10310, 10311, or 10312 has been specified with the CONDITION_CODE variable, use the SYPHILIS_TOTAL Boolean to indicate whether the Syphilis Total grouping is desired or not.

This Boolean variable is ignored if ENABLE_GROUPING = False, as well as for any codes other than 10310, 10311, and 10312.

4. Specify the Output File Name

The synthetic data will be written to a file in the OUTPUT_DIR folder (see below). The format can be either CSV or JSON, with CSV being the default format. The next variable is a string that specifies the name of this output file (no extension needed). Since this is a Python string, surround it with either single or double quotes.

If an output file name is not specified, a default name will be generated according to this format: synthetic_<code_list>_<jurisdiction>.csv.

For instance, the default output file name for Dengue (code 10680) in California is synthetic_10680_ca.csv.

The default output file name for Syphilis Group 1 (codes 10310, 10311, and 10312) in California is synthetic_10310_10311_10312_ca.csv, assuming data for all three codes exists for this jurisdiction. If no data for code 10310 is present, the default name is synthetic_10311_10312_ca.csv.

The default output file name for Syphilis Group 2 in California is synthetic_syphilis_total_ca.csv.

To use the default name, set the variable OUTPUT_FILE_NAME to the special Python value None (no quotes).

5. Optional Inputs

These variables will rarely need to be changed after configuring them for the initial run.

END OF USER INPUTS

All code below this point runs without additional user input.

To run the notebook and generate synthetic data for the given condition and jurisdiction: from the Kernel menu, select Restart & Clear Output, then select Restart & Run All.

Import Required Libraries

An import section is standard for all Python programs. Here we import basic system libraries, numpy, plotting routines, and several custom modules developed for this project.

Enable Debug Output

If the user wants debug output it is enabled first:

Construct the List of Input Files

The next task is to use the user-specified variables to determine a list of files to be loaded and processed.

We use the condition code to find out if grouping is required, and if so, to obtain all codes for the grouped conditions.

Next we use the list of codes to construct a list of fully-qualified input filepaths:

Construct the Path to the Output File

The next task is to take the supplied output file name and construct a proper Python path for it. A .csv extension will be added if not already present.

Initialize the Random Number Generator

Create the RNG and seed it either from the supplied seed or from the system entropy pool:

Check the Sample Count

BEGIN SYNTHETIC DATA GENERATION

The first thing to do is to capture the start time:

Load Input Files

The next step is to load and merge the input files, initialize all data structures, etc. A full model will be created and initialized. The file loader prints information about the data after loading.

Plot Marginal Distributions

It is helpful to look at plots of the marginal distributions for each variable. These plots reveal whether any categorical variables are concentrated in only a few values, indicating that those variables may be essentially uncorrelated with the others.

The AGE and COUNTY plots are drawn much larger than the other plots, to overcome any resolution artifacts. The AGE variable has 122 possibilities, and the COUNTY variable could have as many as 255 values for the state of Texas.

The AGE_UNKNOWN value has been remapped from 999 to -1, to make it contiguous with the other AGE values.

Plot the Empirical Cumulative Distribution Functions and their Inverses

It is also instructive to look at the empirical cumulative distribution functions (ECDF) and their inverses. The copula model uses the inverse ECDFs to partition the values for each categorical variable appropriately. The ECDF and inverse ECDF functions should all be monotonic, smooth, and well-behaved.

Signal Processing

With the data successfully loaded the synthetic timeseries can be generated.

The next function computes the number of case reports per day from the available data. The number of case reports per day is the "signal", which will be modified by the signal processing code.

Generate Synthetic Fourier Result with Adaptive Amplitude Noise

With the timeseries available, the next step is to modify it from the original by adding noise. The adaptive noise generator will vary the amount of noise based on the local characteristics of the signal.

Modify Very Sparse Timeseries

Results for very sparse datasets can be improved by modifying the values of some of the nonzero counts and slightly changing their positions in the timeseries. The next function call performs these modifications.

Plot the Signal and the Individual Date Traces

Plots of different dates vs. time can be useful for determining the completeness of the original data. The signal derived above appears in the first subplot. Subsequent subplots show the number of case reports sharing common date values for each different date type.

Plot the Original and Synthetic Timeseries

It is helpful to look at a plot of the original timeseries data for the COUNT variable, along with the synthetic timeseries and a difference signal. Visual inspection of the plots reveals whether the signal processing code was successful at generating a realistic result.

Compute the Number of Synthetic Categorical Tuples to Generate

The synthetic timeseries provides a value of the COUNT variable for each day. These counts need to be summed so that the copula model will know how many synthetic categorical tuples to generate. This value can be overridden by the user if desired via the NUM_SAMPLES variable.

It is important to note that the sum of all COUNT values in the synthetic timeseries will likely differ from the same quantity in the original data.

Run the Copula Model

With the number of samples determined the copula model can run. The copula model generates tuples of synthetic values for the categorical variables.

Generate Pseudoperson Date Tuples

The next function generates a list of date tuples from which the synthetic correlated dates will be derived. There is one entry in date_tuple_list for each synthetic sample.

Write the Output File

At this point all synthetic data has been generated, so the output file can be written:

Determine Elapsed Time

It is also useful to know the overall runtime, which includes the time required to generate the plots above:

END OF SYNTHETIC DATA GENERATION

Load the Output File and Generate Correlation Matrix Element Convergence Plots

Checking the convergence behavior of the Kendall's tau correlation matrix elements is also useful. The plot below provides an indication of how well the rank correlations present in the original data are preserved in the synthetic data.

The code below checks the correlations for these variables: AGE, SEX, RACE, ETHNICITY, CASE_STATUS, and COUNTY. A 6D model has a 6x6 tau correlation matrix, for a total of 36 elements. The six elements along the diagonal all have the value 1.0. The matrix is also symmetric, which means that there are (36-6)/2 == 15 independent matrix elements. These elements will be plotted below in a plot containing three rows of five subplots each.

This is the procedure for generating the convergence plots:

Load the Output File and Build Data Structures

Plot the Marginal Distributions for the Synthetic Data

The marginal distributions for the synthetic data should look *similar* to those of the original data set. The similarity increases as the number of synthetic data samples increases.

Plot the Empirical Cumulative Distribution Functions and their Inverses for the Synthetic Data

Derive the Synthetic Signal from the Synthetic Data

This next code block derives the synthetic signal (number of case reports vs. time) from the synthetic data, using the same techniques as for the original signal above.

Plot the Synthetic Timeseries

Plot the original signal and the synthetic signal loaded from the synthetic data file. This is a check on the signal to see that the dates were generated and written correctly.

Compute a 6x6 Submatrix of the Kendall's Tau Rank Correlation Matrix for Varying Sample Sizes

With the data loaded, we take slices of length 32, 64, 128, etc. from the synthetic data arrays, compute a 6x6 tau correlation matrix, and save the values of the matrix elements. The variables chosen for the correlation matrix in this section have analogs in the NETSS synthetic data.

Plot the Correlation Matrix Elements vs. Sample Size

IMPORTANT: The correlation plots may exhibit a dependence on the random number generator seed. This will emphatically be the case for smaller datasets (a few hundred points or less). Kendall's tau measures rank correlations in the categorical variable values. Different sequences of random numbers will obviously affect these correlations, and the effects are more pronounced for smaller datasets.

The best way to determine whether the copula model preserves rank correlations is to generate more samples and look at the correlation plots. Larger sample counts tend to wash out the spurious correlations that appear in smaller data sets. Be careful if you do this, though: the output file writer will repeat the synthetic timeseries in to the future for sample counts that exceed the number required by the copula model. For very sparse datasets, the maximum HL7 date of 9999-12-31 could be reached, in which case the software will be forced to stop at however many samples it had written at that point.

Convergence is also affected by unbalanced marginal distributions.

The dots in each plot appear at powers of 2 starting with 32, 64, 128, ... up to a maximum of 262144. This produces equally-spaced points on the x-axis, which has a log scale. If the number of synthetic samples is not a power of two, the final point will not appear evenly spaced with the others.